home *** CD-ROM | disk | FTP | other *** search
- #!/bin/sh -xe
- # README.linux.words - file used to create linux.words
- # Created: Wed Mar 10 09:12:49 1993 by faith@cs.unc.edu (Rik Faith)
- # Revised: Sat Mar 13 17:02:08 1993 by faith@cs.unc.edu
- #
- # Care was taken to be sure that the linux.words list was free of
- # copyright. This makes linux.words a suitable /usr/dict/words
- # replacement for the Linux community.
- #
- # Since the majority of the words are from Tanenbaum's minix.dict file,
- # the notice from Barry Brachman, included below, should accompany any
- # redistribution of this list.
-
- # Here is a detailed explaination of how I created the linux.words file.
- #
- # This README.words file is actually a shell script that you can use to
- # recreate the linux.words file from original sources.
- #
- # First, I started with minix.dict
- # from cs.ubc.ca:/pub/local/src/sp-1.5/wordlists-1.0.tar.Z
- #
- # The following is from the NOTES file in wordlists-1.0.tar.Z:
-
- # NOTES> These word lists were collected by Barry Brachman
- # NOTES> <brachman@cs.ubc.ca> at the University of British Columbia. They
- # NOTES> may be freely distributed as long as this notice accompanies them.
- # NOTES>
- # NOTES> ==================================================================
- # NOTES> Info for minix.dict:
- # NOTES>
- # NOTES> Article 1997 of comp.os.minix:
- # NOTES> From: ast@botter.UUCP
- # NOTES> Subject: A spelling checker for MINIX
- # NOTES> Date: 6 Jan 88 22:28:22 GMT
- # NOTES> Reply-To: ast@cs.vu.nl (Andy Tanenbaum)
- # NOTES> Organization: VU Informatica, Amsterdam
- # NOTES>
- # NOTES> This dictionary is NOT based on the UNIX dictionary so it is free
- # NOTES> of AT&T copyright. I built the dictionary from three sources.
- # NOTES> First, I started by sorting and uniq'ing some public domain
- # NOTES> dictionaries. Second, as some of you probably know, I have
- # NOTES> written somewhere between 3 and 6 books (depending on precisely
- # NOTES> what you count) and an additional 50 published papers on operating
- # NOTES> systems, networks, compilers, languages, etc. This data base,
- # NOTES> which is online, is nonnegligible :-) Finally, I added a number of
- # NOTES> words that I thought ought to be in the dictionary including all
- # NOTES> the U.S. states, all the European and some other major countries,
- # NOTES> principal U.S. and world cities, and a bunch of technical terms.
- # NOTES> I don't want my spelling checker to barf on arpanet, diskless,
- # NOTES> modem, login, internetwork, subdirectory, superuser, vlsi, or
- # NOTES> winchester just because Webster wouldn't approve of them. All in
- # NOTES> all, the dictionary is over 40,000 words. If you have any
- # NOTES> suggestions for additions or deletions, please post them. But
- # NOTES> please be sure you are not infringing on anyone's copyright in
- # NOTES> doing so.
- # NOTES>
- # NOTES> Andy Tanenbaum (ast@cs.vu.nl)
-
- # The main problem with minix.dict is that many proper names are not
- # capitalized. So, I got english.tar.Z from ftp.uu.net:/doc/dictionaries,
- # which is a mirror of nic.funet.fi:/pub/unix/security/dictionaries.
- #
- # Here is part of the README file for english.tar.Z:
-
- # README>
- # README> FILE: english.words
- # README> VERSION: DEC-SRC-92-04-05
- # README>
- # README> EDITOR
- # README>
- # README> Jorge Stolfi <stolfi@src.dec.com>
- # README> DEC Systems Research Center
- # README>
- # README> AUTHORS OF ORIGIONAL WORDLISTS
- # README>
- # README> Andy Tanenbaum <ast@cs.vu.nl>
- # README> Barry Brachman <brachman@cs.ubc.ca>
- # README> Geoff Kuenning <geoff@itcorp.com>
- # README> Henk Smit <henk@cs.vu.nl>
- # README> Walt Buehring <buehring%ti-csl@csnet-relay>
- #
- # [stuff seleted]
- #
- # README> AUXILIARY LISTS
- # README>
- # README> In the same directory as englis.words there are a few
- # README> complementary word lists, all derived from the same sources
- # README> [1--8] as the main list:
- # README>
- # README> english.names
- # README>
- # README> A list of common English proper names and their derivatives.
- # README> The list includes: person names ("John", "Abigail",
- # README> "Barrymore"); countries, nations, and cities ("Germany",
- # README> "Gypsies", "Moscow"); historical, biblical and mythological
- # README> figures ("Columbus", "Isaiah", "Ulysses"); important
- # README> trademarked products ("Xerox", "Teflon"); biological genera
- # README> ("Aerobacter"); and some of their derivatives ("Germans",
- # README> "Xeroxed", "Newtonian").
- # README>
- # README> misc.names
- # README>
- # README> A list of foreign-sounding names of persons and places
- # README> ("Antonio", "Albuquerque", "Balzac", "Stravinski"), extracted
- # README> from the lists [1--8]. (The distinction betweeen
- # README> "English-sounding" and "foreign-sounding" is of course rather
- # README> arbitrary).
- # README>
- # README> org.names
- # README>
- # README> A short lists names of corporations and other institutions
- # README> ("Pepsico", "Amtrak", "Medicare"), and a few derivatives.
- # README>
- # README> The file also includes some initialisms --- acronyms and
- # README> abbreviations that are generally pronounced as words rather
- # README> than spelled out ("NASA", "UNESCO").
- # README>
- # README> english.abbrs
- # README>
- # README> A list of common abbreviations ("etc.", "Dr.", "Wed."),
- # README> acronyms ("A&M", "CPU", "IEEE"), and measurement symbols
- # README> ("ft", "cm", "ns", "kHz").
- # README>
- # README> english.trash
- # README>
- # README> A list of words from the original wordlists
- # README> that I decided were either wrong or unsuitable for inclusion
- # README> in the file english.words or any of the other auxiliary
- # README> lists. It includes
- # README>
- # README> typos ("accupy", "aquariia", "automatontons")
- # README> spelling errors ("abcissa", "alleviater", "analagous")
- # README> bogus derived forms ("homeown", "unfavorablies", "catched")
- # README> uncapitalized proper names ("afghanistan",
- # README> "algol", "decnet")
- # README> uncapitalized acronyms ("apl", "ccw", "ibm")
- # README> unpunctuated abbreviations ("amp", "approx", "etc")
- # README> British spellings ("advertize", "archaeology")
- # README> archaic words ("bedight")
- # README> rare variants ("babirousa")
- # README> unassimilated foreign words ("bambino", "oui", "caballero")
- # README> mis-hyphenated compounds ("babylike", "backarrows")
- # README> computer keywords and slang ("lconvert", "noecho", "prog")
- # README>
- # README> (I apologize for excluding British spellings. I should have
- # README> split the list in three sublists--- common English, British,
- # README> American---as ispell does. But there are only so many hours
- # README> in a day...)
- # README>
- # README> english.maybe
- # README>
- # README> A list of about 5,000 lowercase words from the "mts.dict"
- # README> wordlist [6] that weren't included in english.words.
- # README>
- # README> This list seems to include lots of "trash", like
- # README> uncapitalized proper names and weird words. It would
- # README> take me several days to sort this mess, so I decided to
- # README> leave it as a separate file. Use at your own risk...
- #
- # [stuff deleted]
- #
- # README> (NON-)COPYRIGHT STATUS
- # README>
- # README> To the best of my knowledge, all the files I used to build these
- # README> wordlists were available for public distribution and use, at least
- # README> for non-commercial purposes. I have confirmed this assumption with
- # README> the authors of the lists, whenever they were known.
- # README>
- # README> Therefore, it is safe to assume that the wordlists in this
- # README> package can also be freely copied, distributed, modified, and
- # README> used for personal, educational, and research purposes. (Use of
- # README> these files in commercial products may require written
- # README> permission from DEC and/or the authors of the original lists.)
- # README>
- # README> Whenever you distribute any of these wordlists, please distribute
- # README> also the accompanying README file. If you distribute a modified
- # README> copy of one of these wordlists, please include the original README
- # README> file with a note explaining your modifications. Your users will
- # README> surely appreciate that.
- # README>
- # README> (NO-)WARRANTY DISCLAIMER
- # README>
- # README> These files, like the original wordlists on which they are
- # README> based, are still very incomplete, uneven, and inconsitent, and
- # README> probably contain many errors. They are offered "as is" without
- # README> any warranty of correctness or fitness for any particular
- # README> purpose. Neither I nor my employer can be held responsible for
- # README> any losses or damages that may result from their use.
-
- # subtract english.trash
- cat minix.dict english.trash english.trash | sort | uniq -u > dict.1
- # subtract english.maybe
- cat dict.1 english.maybe english.maybe | sort | uniq -u > dict.2
-
- # build subtraction list of proper names and abbreviations
- cat english.names misc.names org.names computer.names english.abbrs > sub.1
- tr 'A-Z' 'a-z' < sub.1 | sort | uniq -u > sub.2
-
- # subtract proper names with incorrect capitalization
- cat dict.2 sub.2 sub.2 | sort | uniq -u > dict.3
-
- # build proper name list without possessives
- cat english.names misc.names org.names computer.names | fgrep -v \'s > names.1
-
- # add in proper names (use sort twice to get uppercase before lowercase)
- cat dict.3 names.1 | sort | sort -df | uniq > linux.words
-
- # clean up
- rm dict.[123] sub.[12] names.1
-